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I  Introduction 


In  classical  mental  test  theory,  the  reliability  and  the  validity  coefficients  of  a  test  are  considered  to 
be  two  essential  topics.  In  modern  mental  test  theory,  or  in  latent  trait  models,  this  is  not  the  case, 
however.  In  particular,  test  validity  is  one  concept  that  has  been  neglected  in  the  context  of  latent  trait 
models. 

Several  types  of  validity  have  been  identified  and  discussed  in  classical  mental  test  theory,  which 
include  content  validity,  construct  validity,  and  criterion-oriented  validity.  Perhaps  we  can  say  that,  in 
modern  mental  test  theory,  both  content  validity  and  construct  validity  are  well  accomodated,  although 
they  are  not  explicitly  stated.  If  each  item  is  based  upon  cognitive  processes  that  are  directly  related 
to  the  ability  to  be  measured,  then  the  content  of  the  operationally  defined  latent  variable  behind 
the  examinees’  performances  will  be  validated.  Also  construct  validity  can  be  identified,  with  all  the 
mathematically  sophisticated  structures  and  functions  which  characterize  latent  trait  models  and  which 
classical  mental  test  theory  does  not  provide. 

With  respect  to  the  criterion-oriented  validity,  however,  so  far  latent  trait  models  have  not  offered 
so  much  as  they  did  to  the  test  reliability  and  to  the  standard  error  of  measurement  (cf.  Samejima, 
1977,  1990).  FYom  the  scientific  point  of  view,  however,  we  need  to  confirm  if,  indeed,  the  test  measures 
what  it  is  supposed  to  measure,  even  if  we  have  chosen  our  items  carefully  enough  in  regard  to  their 
contents,  and  even  if  we  are  equipped  with  highly  sophisticated  mathematics. 

In  classical  mental  test  theory,  the  validity  coefficient  is  a  single  number,  i.e.,  the  product-moment 
correlation  coefficient  between  the  test  score  and  the  criterion  variable.  Researchers  tend  to  put  too 
much  faith  in  the  validity  coefficient,  or  in  the  reliability  coefficient,  however.  The  correlation  coefficient 
is  largely  affected  by  the  heterogeneity  of  the  group  of  examinees,  i.e.,  for  a  fixed  test  the  coefficient 
tends  to  be  higher  when  individual  differences  among  the  examinees  in  the  group  are  greater,  and  vice 
versa  (cf.  Samejima,  1977).  Thus  we  must  keep  in  mind  that  so-called  test  validity  represents  the  degree 
of  heterogeneity  in  ability  among  the  examinees  tested,  as  well  as  the  quality  of  the  test  itself. 

By  virtue  of  the  population-free  nature  of  latent  trait  theory,  we  should  be  able  to  find  some  indices 
of  item  validity,  and  of  test  validity,  which  are  not  affected  by  the  group  of  examinees.  The  resulting 
indices  should  not  be  incidental  as  those  in  classical  mental  test  theory  are,  but  truly  be  attributes  of 
the  item  and  the  test  themselves. 

In  the  present  research  an  attempt  has  been  made  to  obtain  such  population-free  measures  of  item 
validity  and  of  test  validity,  which  are  basically  locally  defined. 

II  Performance  Function:  Regression  of  the  External  Crite¬ 
rion  Variable  on  the  Latent  Variable 

Let  9  be  ability,  or  latent  trait,  which  assumes  any  real  number.  We  assume  that  there  is  a  set 
of  n  test  items  measuring  9  whose  characteristics  are  known.  T  et  g  denote  such  an  item,  kg  be 
a  discrete  item  response  to  item  g  ,  and  Pkt(9)  denote  the  'pcrating  characteristic  of  ,  or  the 
conditional  probability  assigned  to  kg  ,  given  6  ,  i.e., 

(2.1)  Pk>(9)  =  Prob.\kg\9]  . 

We  assume  that  Pk,(9)  is  three-times  differentiable  with  respect  to  9  .  We  have  for  the  item  response 
information  function 
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(2.2) 


and  the  item  information  function  is  defined  as  the  conditional  expectation  of  7*e  (0)  ,  given  8  ,  so 
that  we  can  write 

(2.3)  m  =  E[ik,(6 )  i ;]  =  x>fm,(*) . 

kg 

In  the  special  case  where  the  item  g  is  scored  dichotomously,  this  item  information  function  is  simplified 
to  become 

(2-4)  PA»)\2  [{P.(«)}{1  -  Pg[t))\-1  • 

where  P„(8)  is  the  operating  characteristic  of  the  correct  answer  to  item  g  . 

Let  V  be  a  response  pattern  such  that 

(2.5)  V  =  {kay  g  =  1,2, ..., n  . 


The  operating  characteristic,  TV  (0)  ,  of  the  response  patten  V  is  defined  as  the  conditional  probability 
of  V  ,  given  8  ,  and  by  virtue  of  local  independence  we  can  write 

(2.6)  jv(*)=  n  w)  ■ 

k,  iV 

The  response  pattern  information  function,  Iv  (8)  ,  is  given  by 

(2-7)  lv{9)  =  -^logfV(S)  =  > 

k,(V 

and  the  test  information  function,  7(5)  ,  is  defined  as  the  conditional  expectation  of  Iv{8)  ,  given  8  , 
and  we  obtain  from  (2.2),  (2.3),  (2.5),  (2.6)  and  (2.7) 

(2.8)  /(0)  =  £[7v(0)|0]  =  ^/v(fl)7VW=E/«(e)  • 

V  g=l 


A  big  advantage  of  the  modern  mental  test  theory  is  that  the  standard  error  of  estimation  can  locally 
be  defined  by  using  |/(0)]-1/2  .  Unlike  in  classical  mental  test  theory  this  function  does  not  depend 
upon  the  population  of  examinees,  but  is  solely  a  property  of  the  test  itself,  which  should  be  the  way  if 
we  call  it  the  standard  error,  or  the  reliability,  of  a  test.  It  is  well  known  that  this  function  provides 
us  with  the  asymptotic  standard  deviation  of  the  conditional  distribution  of  the  maximum  likelihood 
estimate  of  8  ,  given  its  true  value. 

It  is  assumed  that  there  exists  an  external  criterion  variable,  which  can  be  measured  directly  or 
indirectly.  This  is  the  situation  which  is  also  assumed  when  we  deal  with  criterion- oriented  validity  or 
predictive  validity  in  classical  mental  test  theory. 
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FIGURE  2-1 

Relationships  among  6  ,  ,  pa  ,  ((7  |  J)  and  f(0)  . 
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FIGURE  2-2 

Two  Hypothetical  Performance  Functions  f(0)  ,  One  of  Which  Is  Not  Likely  to  Be 
the  Cue  (Solid  Line),  and  the  Other  Hu  a  Derivative  Equal  to  Zero  at  One  Point 

of  6  (Dashed  Line). 


Let  7  denote  the  criterion  variable,  representing  the  performance  in  a  specific  job,  etc.  We  shall 
consider  the  conditional  density  of  the  criterion  performance,  given  ability,  and  denote  it  by  £(7  |  8)  . 
The  performance  function,  ?(0)  ,  can  be  defined  as  the  regression  of  7  on  8  ,  or  by  taking,  say,  the 
75,  90  or  95  percentile  point  of  each  conditional  distribution  of  7  ,  given  8  .  Let  pa  denote  the 
probability  which  is  large  enough  to  satisfy  us  as  a  confidence  level.  Thus  we  can  write 


where  7  denotes  the  least  upper  bound  of  the  criterion  variable  7  . 

Figure  2-1  illustrates  the  relationships  among  8  ,  7  ,  pa  ,  £(7  |  8)  and  ?(0)  .  It  may  be  reasonable 
to  assume  that  the  functional  relationship  between  8  and  ?(0)  is  relatively  simple,  not  as  is  illustrated 
by  the  solid  line  in  Figure  2-2,  i.e.,  we  do  not  expect  ?(0)  to  go  up  and  down  frequently  within  a 
relatively  short  range  of  8  .  We  shall  assume  that  ?(0)  is  twice  differentiable  with  respect  to  8  . 

In  dealing  with  an  additional  dimension  or  dimensions,  i.e.,  the  criterion  variable  or  variables,  in 
latent  space,  one  of  the  most  difficult  things  is  to  keep  the  population-free  nature,  which  is  characteristic 
of  the  latent  trait  models,  the  main  feature  that  distinguishes  the  theory  from  classical  mental  test 
theory,  among  others.  If  we  consider  the  projection  of  the  operating  characteristic  of  a  discrete  item 
response  on  the  criterion  dimension,  for  example,  then  the  resulting  operating  characteristic  as  a  function 
of  7  has  to  be  incidental,  for  it  has  to  be  affected  by  the  population  distribution  of  8  . 

We  need  to  start  from  the  conditional  distribution  of  7  ,  given  6  ,  therefore,  which  can  be  conceived 
of  as  being  intrinsic  in  the  relationship  between  the  two  variables,  and  independent  of  the  population 
distribution  of  6  . 

We  assume  that  ?(0)  takes  on  the  same  value  only  at  a  finite  or  an  enumerable  number  of  points 
of  8  .  Let  (? )  be  the  conditional  probability  assigned  to  the  discrete  response  kg  ,  given  f  .  We 
can  write 

(2.10)  p;g{<)=  ]T  pk,(8)  . 

m=( 

III  When  £(0)  Is  Strictly  Increasing  in  6  :  Simplest  Case 

[111.1]  Amounts  of  Item  and  Test  Information  for  a  Fixed  Value  of  £ 

The  simplest  case  is  that  ?(0)  is  strictly  increasing  in  8  .  In  this  case,  ?(0)  has  a  one-to-one 
correspondence  with  8  ,  and  (2.10)  becomes  simplified  into  the  form 

(3.1)  =  =  • 

If,  in  addition,  dd /3?  is  finite  throughout  the  entire  range  of  8  ,  then  we  obtain 

p.2)  • 

Let  (?)  be  the  item  response  information  function  defined  as  a  function  of  ?  .  We  can  write 
(3.3)  S.(t)  =  -^lo3S;.(t).-^!(^lo8P.,(»))|j] 
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Let  /*  (^ )  and  /*(f)  be  the  amounts  of  information  given  by  a  single  item  g  and  by  the  total 
test,  respectively,  for  a  fixed  value  of  f  .  Then  we  have  from  (2.3),  (2.8)  and  (3.3) 

(3.4)  4(f)  =  £[j;,(f)  Id  =  £  /*\(f)  *?,(*)  =  4(«)  (|f  )2 

k,  f 

and 

(3.5)  /*(?)  -  £  /0*(f)  =  /(»)  (f )2 . 

fl=l  f 

If  we  take  the  square  roots  of  these  two  information  functions  defined  for  f  ,  then  we  obtain 

(3.6)  I4(f)]1/3  =  (4(0)]1/2  f~ 
and 

(3.7)  [/*(f)]1/2  =  !W/2|^  • 

Since  a  certain  constant  nature  exists  for  the  square  root  of  the  item  information  function  while  the 
same  is  not  true  with  the  original  item  information  function  (cf.  Samejima,  1979,  1982),  [4(f)]1/2 
given  by  (3.6)  instead  of  the  original  function  given  by  (3.4)  may  be  more  useful  in  some  occasions. 
This  will  be  discussed  later  in  this  section,  when  the  validity  in  selection  plus  classification  is  discussed. 

[III. 2]  Validity  in  Selection 

Suppose  that  we  have  a  critical  value,  70  ,  of  the  criterion  variable,  which  is  needed  for  succeeding 
in  a  specified  job,  and  that  we  try  to  accept  applicants  whose  values  of  the  criterion  variable  are  70  or 
greater.  If  our  primary  purpose  of  testing  is  to  make  an  accurate  selection  of  applicants,  then  (3.6)  and 

(3.7)  for  f  =  70  ,  or  their  squared  values  shown  by  (3.4)  and  (3.5),  indicate  item  and  test  validities, 
respectively.  In  other  words,  if  for  some  item  formula  (3.6)  or  (3.4)  assumes  high  values  at  f  =  70  , 
then  the  standard  error  of  estimation  of  f  around  f  =  70  becomes  small  and  chances  are  slim  that  we 
make  misclassifications  of  the  applicants  by  accepting  unqualified  persons  and  rejecting  qualified  ones, 
and  vice  versa.  The  same  logic  applies  to  the  total  test  by  using  formula  (3.7)  or  (3.5)  instead  of  (3.6) 
or  (3.4). 

It  should  be  noted  in  (3.6)  or  in  (3.7),  that  [fg(r)]1/2  or  [/’(r)]1/2  consists  of  two  factors,  i.e.,  1) 
the  square  root  of  the  item  information  function  /„(#)  or  that  of  the  test  information  function  1(6) 
and  2)  the  partial  derivative  of  ability  6  with  respect  to  f  at  $  —  70  .  These  two  factors  in  each 
formula  are  independent  of  each  other,  i.e.,  one  belongs  to  the  item  or  to  the  test  and  the  other  to  the 
statistical  relationship  between  9  and  7  .  We  also  notice  that  these  two  factors  are  in  a  supplementary 
relationship,  i.e.,  even  if  one  assumes  a  small  value  the  other  can  supplement  it  in  order  to  make  the 
resulting  product  large.  Thus  while  it  is  important  to  have  a  large  amount  of  item  information,  or  of 
test  information,  it  is  even  more  so  to  have  large  values  of  the  derivative,  d9/d$  ,  in  the  vicinity  of 
$  —  70  ,  for  this  will  increase  the  amount  of  item  information  defined  with  respect  to  £  uniformly 
in  that  vicinity,  and  also  that  of  test  information,  as  is  obvious  from  the  right  hand  sides  of  (3.6)  and 
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(3.7).  In  other  words,  it  is  desirable  for  the  purpose  of  selection  for  g  to  increase  slowly  in  6  in  the 
vicinity  of  g  =  'to  • 

Since,  in  general,  the  same  ability  6  has  predictabilities  for  more  than  one  kind  of  job  performance, 
or  of  potential  of  achievement,  the  performance  function  varies  for  different  criterion  variables.  Note 
that  neither  [f^)]1/2  nor  [/(0)]1/2  is  changed  even  when  the  criterion  variable  is  switched.  Thus, 
for  a  fixed  item  or  test  whose  amount  of  information  is  reasonably  large  around  g  =  'to  t  the  derivative 
dd/dg  in  the  vicinity  of  g  ~  'to  determines  the  appropriateness  of  the  use  of  the  item  or  of  the  test  for 
the  purpose  of  selection  with  respect  to  a  specific  job,  etc.  If  this  derivative  assumes  a  high  value,  then 
an  item  or  a  test  which  provides  us  with  a  medium  amount  of  information  may  be  acceptable  for  our 
purpose  of  selection,  while  we  will  need  an  item  or  a  test  whose  amount  of  information  is  substantially 
larger  if  the  derivative  is  low.  Abo  for  the  same  criterion  variable  7  the  derivative  dd/dg  varies  for 
different  values  of  'to  ,  so  the  appropriateness  of  an  item  or  of  a  test  depends  upon  our  choice  of  70  , 
too. 

The  above  logic  also  applies  for  the  formulae  (3.4)  and  (3.5),  i.e.,  for  the  case  in  which  we  choose 
the  information  functions,  instead  of  their  square  roots,  changing  dd/dg  to  its  squared  value. 

It  is  obvious  from  (3.4)  and  (3.6)  that  we  can  choose  either  4(0(7o))  or  [4(0(7o))]1/2  ^  use  in 
item  selection,  for  their  rank  orders  across  different  items  are  identical,  and  they  equal  the  rank  orders 
of  /* (70)  as  well  as  those  of  [/j(7o)]1^2  • 

[III. 3]  Validity  in  Selection  Plus  Classification 

If  we  take  another  standpoint  that  our  purpose  of  testing  is  not  only  to  make  a  right  selection  of 
applicants  but  also  to  predict  the  degree  of  success  in  the  job  for  each  selected  individual,  then  we  will 
need  to  integrate  [Z*fy)]1|/2  and  {/‘(f)]1/2  ,  respectively,  since  we  must  estimate  g  accurately  not 
only  around  f  =  70  but  also  for  g  > 'to  ■  If  we  choose  [/J(?)]1/2  and  [/‘(f))1/2  in  preference  to 
their  squared  values,  we  will  obtain  from  (3.6)  and  (3.7) 


(3.8) 

fa  I4‘W]1/a  *  =  J 

f  14(*)]I/2  & 

n. 

and 

(3.9) 

l ’  [/*(?)]1/2  <*r«  j 

f  \m\'l*d6  , 

n. 

where  fif 

and  0#  indicate  the  domains  of  g  and 

6  for  which  f(0)  >  70  ,  respectively.  In  other 

words,  when  our  purpose  of  testing  is  not  only  to  make  an  accurate  selection  among  the  applicants  but 
also  to  discriminate  their  ability  accurately  for  future  purposes  among  those  who  were  accepted  with 
respect  to  the  criterion  variable  7  ,  we  need  to  select  items  which  assume  high  values  of  (3.8)  instead 
of  (3.6),  or  a  test  which  provides  us  with  a  high  value  of  (3.9)  in  place  of  (3.7). 

Note  that  formulae  (3.8)  and  (3.9)  imply  that  we  can  obtain  these  two  validity  measures  directly 
from  the  original  item  and  test  information  functions,  respectively,  i.e.,  without  actually  transforming 
6  to  $  ,  as  long  as  we  can  identify  the  domain  ft#  .  This  is  true  for  any  criterion  variable  7  . 

Some  examples  illustrating  the  values  of  (3.8)  are  given  in  Figure  3-1  for  hypothetical  items.  In  the 
simplest  case  observed  in  this  section  and  illustrated  in  Figures  2-1  and  3-1,  these  two  domains,  0« 
and  ftf  ,  are  provided  by  the  two  intervals,  (0O  ,  00  )  and  (70  ,  7)  ,  where 

(3.10)  0o  =  0(7o) 
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FIGURE  3-1 

Some  Examples  of  the  Relationship  between  'jo  and  the  Item  Validity  Measure 

Given  by  (3.8). 
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FIGURE  3-2 

Relationship  between  70  and  Item  Validity  Indicated  by  (3.8)  for  Three  Hypothetical 
Dichotomous  Items  Whose  Operating  Characteristics  for  the  Correct  Answer  Are 
Strictly  Increasing  with  Zero  and  Unity  as  Their  Asymptotes. 


and  7  denotes  the  least  upper  bound  of  7  . 

It  should  be  noted  that  this  pair  of  validity  measures  depends  upon  our  choice  of  the  critical  value 
70  .  If  this  value  is  low,  i.e.,  a  specified  job  does  not  require  high  levels  of  competence  with  respect  to 
the  criterion  variable  7  ,  then  these  validity  indices  assume  high  values,  and  vice  versa.  It  has  been 
pointed  out  (Samejima,  1979,  1982)  that  there  is  a  certain  constancy  in  the  amount  of  information 
provided  by  a  single  test  item.  To  give  some  examples,  if  an  item  is  dichotomously  scored  and  has  a 
strictly  increasing  operating  characteristic  for  success  with  *ero  and  unity  as  its  two  asymptotes,  then 
the  area  under  the  curve  for  [ ( ^ )  ] 1  /2  equals  x  ,  regardless  of  the  mathematical  form  of  the  operating 
characteristic  and  its  parameter  values ;  if  it  follows  a  three-parameter  model  with  the  lower  asymptote, 
cg  (>  0)  ,  then  this  area  is  less  than  x  and  strictly  decreasing  in,  and  solely  dependent  upon,  cg  .  We 
can  see,  therefore,  that  if  our  items  belong  to  the  first  type  then  the  functional  relationship  between 
70  and  the  item  validity  measure  given  by  (3.8)  will  be  monotone  decreasing ,  with  x  and  sero  as  its 
two  asymptotes,  for  each  and  every  item.  Figure  3-2  illustrates  this  relationship  for  three  hypothetical 
items  of  this  type.  As  we  can  see  in  this  figure,  the  appropriateness  of  the  items  changes  with  70  in  an 
absolute  sense,  and  also  relatively  to  other  items  with  70  ,  and  the  rank  orders  of  desirability  among 
the  items  depend  upon  our  choice  of  70  . 

We  can  see  from  (3.8)  that  this  validity  measure  necessarily  assumes  a  high  value  if  an  item  is 
difficult,  and  the  same  applies  to  (3.9)  for  the  total  test.  This  implies  that  these  validity  measures  alone 
cannot  indicate  the  desirability  of  an  item  and  of  a  test  precisely  for  a  specific  population  of  examinees. 
In  selecting  items  or  a  test,  therefore,  it  is  desirable  to  take  the  ability  distribution  of  the  examinees 
into  account,  if  the  information  concerning  the  ability  distribution  of  a  target  population  is  more  or  less 
available.  In  so  doing  we  shall  be  able  to  avoid  choosing  items  which  are  too  difficult  for  the  target 
population  of  examinees. 

Let  f(9)  denote  the  density  function  of  the  ability  distribution  for  a  specific  population  of  examinees, 
and  }*($)  be  that  of  g  for  the  same  population.  Then  we  can  write 

(3-11)  /*(?)  =  /(*)  ^  • 


Adopting  this  as  the  weight  function,  from  (3.6)  and  (3.7)  we  obtain  as  the  validity  indices  tailored  for 
a  specific  population  of  examinees 

(3.12) 

Jn  \W)1/2  f[s)  dg  =  J 

r  \m\112  m  j:  ds 

n$  dg 

and 

(3.13) 

/n  in?)]1'3  ns)  dg  =  j 

r  \ml/2  m  j-  d6  . 

n»  dg 

Thus  by  using  (3.12)  and  (3.13)  instead  of  (3.8)  and  (3.9)  we  shall  be  able  to  make  appropriate  item 
selection  and  test  selection  for  a  target  population  or  sample,  provided  that  the  information  concerning 
its  ability  distribution  is  more  or  less  available.  Note  that,  unlike  (3.8)  and  (3.9),  we  need  dd/dg 
in  evaluating  these  measures  given  by  (3.12)  and  (3.13).  Thus  not  only  are  these  validity  measures 
specific  for  the  ability  distribution  of  a  target  population,  but  also  they  are  heavily  dependent  upon  the 
functional  formula  of  ^(0)  . 

If  we  choose  to  use  the  area  under  the  curve  of  the  information  function  instead  of  that  of  its  square 
root,  we  obtain  from  (3.4)  and  (3.5) 
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/  /;(?)  *  =  f  I„(6)  ^  d6 

J  nf  J  n# 


f  ru)d<=  f  i(e)  ^de  , 

Jn,  Jn,  og 

respectively.  We  notice  that  in  this  case,  unlike  those  of  (3.8)  and  (3.9),  the  integrands  of  the  right 
hand  sides  of  (3.14)  and  (3.15)  are  no  longer  independent  of  the  functional  formula  of  f  ($)  .  Abo  when 
information  about  the  ability  distribution  of  a  target  population  of  examinees  is  more  or  less  available, 
the  “tailored”  item  and  test  validity  indices  become 

(3-16)  [  5(f)  /*(?)  d<  =  [  Ig(6)  f(t)  (|V  d6 

J n{  J n,  "i 

and 

(3.1V)  f  r(f)  /*  (?)  ds=  f  1(6)  f(6)  (|^)2  dd  , 

Jci,  Jn , 

respectively,  if  we  choose  to  use  the  infomation  functions  instead  of  their  square  roots. 

Note  that,  unlike  the  validity  measures  for  “selection"  purposes,  in  the  present  situation  the  rank 
orders  of  validity  across  different  items,  or  different  tests,  depend  upon  the  choice  of  the  validity  index. 
Thus  a  question  is:  which  of  the  formulae,  (3.8)  or  (3.14),  and  (3.9)  or  (3.15),  are  better  as  the  item  and 
the  test  validity  indices  for  “selection  plus  classification"  purposes?  A  similar  question  is  abo  addressed 
with  respect  to  (3.12)  and  (3.16),  and  to  (3.13)  and  (3.17).  These  are  tough  questions  to  answer.  While 
the  choice  of  the  square  root  of  the  item  information  function  has  an  advantage  of  a  certain  constancy 
which  has  been  observed  earlier  in  thb  subsection,  the  use  of  the  item  information  has  a  benefit  of 
additivity,  i.e.,  by  virtue  of  (2.8)  the  sum  total  of  (3.14)  over  all  the  item  g  'a  equals  (3.15),  and 
the  same  relationship  holds  between  (3.16)  and  (3.17).  The  answers  to  these  questions  are  yet  to  be 
searched. 

[111. 4]  Validity  in  Classification 

When  our  purpose  of  testing  b  strictly  the  classification  of  individuab,  as  in  assigning  those  people 
to  different  training  programs,  in  guidance,  etc.,  (3.8)  and  (3.9),  or  (3.14)  and  (3.15),  abo  serve  as  the 
validity  measures  of  an  item  and  of  a  test,  respectively.  In  thb  case,  we  must  set  70  =  7  in  defining 
the  domains,  fi{  and  fi$  i  where  7  b  the  greatest  lower  bound  of  7  .  Thus  the  two  domains,  Cl( 
and  fi#  ,  in  these  formulae  become  those  of  ?  and  6  for  which  7  <  ?(0)  <  7  .  It  b  obvious  that 
these  formulae  provide  us  with  the  item  and  the  test  validity  measures,  respectively,  for  the  same  reason 
explained  in  [III. 3]. 

The  same  logic  applies  for  the  “tailored"  validity  measures  provided  by  (3.12)  and  (3.13),  and  by 
(3.16)  and  (3.17),  when  the  information  concerning  the  ability  dbtribution  of  a  target  population  b 
more  or  less  available. 

[111. 5]  Computerized  Adaptive  Testing 

The  item  information  function,  Ia(6)  ,  has  been  used  in  the  computerbed  adaptive  testing  in 
selecting  an  optimal  item  to  tailor  a  sequential  subtest  of  items  for  an  individual  examinee  out  of  the 


(3.14) 


(3.15) 
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prearranged  itempool.  A  procedure  may  be  to  let  the  computer  choose  an  item  having  the  highest  value 
of  Ig(9)  at  the  current  estimated  value  of  9  for  the  individual  examinee,  which  is  based  upon  his 
responses  to  the  items  that  have  already  been  presented  to  him  in  sequence,  out  of  the  set  of  remaining 
items  in  the  itempool. 

We  notice  from  (3.4)  or  (3.6)  that  this  procedure  is  justified  from  the  standpoint  of  criterion-oriented 
validity,  for  the  item  which  provides  us  with  the  greatest  item  information  Ig(6)  among  all  the  available 
items  in  the  itempool  also  gives  the  greatest  values  of  /*  (g)  and  its  square  root,  at  any  fixed  value  of 
6  . 


Amount  of  test  information  can  be  used  effectively  in  the  stopping  rule  of  the  computerised  adaptive 
testing.  A  procedure  may  be  to  terminate  the  presentation  of  a  new  item  out  of  the  itempool  to  the 
individual  examinee  when  1(6)  has  reached  an  a  priori  set  amount  at  the  current  value  of  his  estimated 

6  . 


When  we  have  a  specific  criterion  variable  q  in  mind,  it  is  justified  to  use  an  a  priori  set  value  of 
/*(f)  instead  of  1(6)  .  In  so  doing  we  can  obtain  the  value  of  1(6)  corresponding  to  the  a  priori  set 
value  of  /*(?)  for  each  6  ,  through  the  formula 

(318)  m  =  r(c)( 0)2  , 

which  is  obtained  from  (3.7).  Thus  it  is  easy  to  have  the  computer  to  handle  this  situation,  provided 
that  we  know  the  functional  formula  for  f(0)  . 


IV  Test  Validity  Measures  Obtained  from  More  Accurate 
Minimum  Variance  Bounds 

When  {dg/dd}  =0  at  some  value  of  6  ,  as  is  illustrated  by  a  dashed  line  in  Figure  2-2,  dd/dg 
becomes  positive  infinity,  and  so  does  the  item  validity  measure  given  by  (3.6).  This  fact  provides  us 
with  some  doubt,  for,  while  we  can  see  that  at  such  a  point  of  f  item  validity  is  high,  we  must  wonder 
if  positive  infinity  is  an  adequate  measure.  It  is  also  obvious  from  (2.8)  that  the  same  will  happen  to  the 
total  test  if  it  includes  at  least  one  such  item.  Our  question  is:  should  we  search  for  more  meaningful 
functions  than  the  item  and  test  information  functions  f  This  topic  will  be  discussed  in  this  section. 

Necessity  of  the  search  for  a  more  accurate  measure  than  the  test  information  function  becomes 
more  urgent  when  the  performance  function,  g(6)  ,  is  not  strictly  increasing  in  6  ,  but  is,  say,  only 
piecewise  monotone  in  6  with  finite  de/dg  and  differentiable  with  respect  to  6  ,  as  is  illustrated 
in  Figure  4-1.  The  illustrated  performance  function  is  still  simple  enough,  but  indicates  the  trend  that 
after  a  certain  point  of  ability  the  performance  level  in  a  specified  job  decreases.  This  can  happen  when 
the  job  does  not  provide  enough  challenge  for  persons  of  very  high  ability  levels. 

Since  /*  (f )  serves  as  the  reciprocal  of  the  conditional  variance  of  the  maximum  likelihood  estimate  of 
f  only  asymptotically  and  there  exist  more  accurate  minimum  variance  bounds  for  any  (asymptotically) 
unbiased  estimator  (cf.  Kendall  and  Stuart,  1961),  we  can  search  for  more  accurate  test  validity 
measures  than  the  one  given  by  (3.7)  by  using  the  reciprocal  of  the  square  roots  of  such  minimum 
variance  bounds.  Details  of  this  topic  will  be  discussed  in  a  separate  paper.  Here  its  brief  summary 
related  to  validity  measures  will  be  given. 

Let  Jr,(6)  be  defined  as 

(•)  Jr.(9)  E\  Lv^  Lv (0)  1*1  r,  s  —  1, 2, ...,  k 
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FIGURE  4-1 

Example  of  the  Performance  Function  f  (0)  Which  Is  Piecewise  Monotone  in  #  . 


IS 


where 


(4-2)  4r)(*)  =  ^rM*)  =  ^V(*)  • 

Let  J(0)  denote  the  (k  x  k)  matrix  of  the  element  Jr,(8)  ,  and  J"1^)  be  the  corresponding  element 
of  its  inverse  matrix,  J~l(8)  .  Note  that  when  it  =  1  we  can  rewrite  (4.1)  into  the  form 

(4.3)  Jkk(8)  =  J„(«)  =  E[{^  log  Lv(6)y\  8} 

=  -E[~\ogPv(8)\0], 

and  from  this,  (2.7)  and  (2.8)  we  can  see  that  J(0)  is  a  (1  x  l)  matrix  whose  element  is  the  test 
information  function,  1(8)  ,  itself.  A  set  of  improved  minimum  variance  bounds  is  given  by 

(44) 

r=l j  =  1 

(cf.  Kendall  and  Stuart,  1961),  where  f^(0)  denotes  the  s-th  partial  derivative  of  f(0)  with  respect 
to  8  .  We  obtain,  therefore,  for  a  set  of  new  test  validity  measures 

(«•«>  . 

r=l *=1 


where  7^**  indicates  the  s-th  partial  derivative  of  f  with  respect  to  8  at  f  =  70  • 

The  use  of  this  new  test  validity  measure  will  ameliorate  the  problems  caused  by  {d$/d8}  =  0  ,  if 
we  choose  an  appropriate  k  .  The  resulting  algorithm  will  become  much  more  complicated,  however, 
and  we  must  expect  a  substantially  larger  amount  of  CPU  time  for  computing  these  measures  when  k 
is  greater  than  unity.  Note  that  (4.5)  equals  (3.7)  when  Jfc  =  1  . 

V  Multidimensional  Latent  Space 

When  our  latent  space  is  multidimensional,  a  generalisation  of  the  idea  given  in  Section  4  for  the 
unidimensional  latent  space  can  be  made  straightforwardly.  We  can  write 

(51)  8  =  {8U}'  u  =  1,2 . r,  , 


and  the  performance  function  f(0)  becomes  a  function  of  r)  independent  variables.  A  minimum 
variance  bound  is  given  by 


(5-2) 


U= 1 0=1 


where  1^(6)  is  the  (u,  v)-element  of  the  inverse  matrix  of  the  ( rj  x  r;)  symmetric  matrix,  whose 
element  is  given  by 
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FIGURE  5-1 


Area  fi#  for  Different  70  '»  in  Two-Dimensional  Latent  Space 
for  a  Hypothesised  Test. 


(5.3) 


■w-J"zs:5;i,> 


with  L  abbreviating  Lv(8)  ,  or  /V(0)  ■  The  reciprocal  of  the  square  root  of  (5.3)  will  provide  us 
with  the  counterpart  of  (3.7)  for  the  multidimensional  latent  space.  For  q  =  2  ,  the  area  fl#  may  look 
like  one  of  the  contours  illustrated  in  Figure  5-1,  depending  upon  our  choice  of  ~fo  >  taking  the  axis  for 
~1  vertical  to  the  plane  defined  by  and  . 

In  a  more  complex  situation  where  both  ability  and  the  criterion  variables  are  multidimensional,  we 
must  consider  the  projection  of  the  item  information  function  on  the  criterion  subspace  from  the  ability 
subspace,  in  order  to  have  the  item  validity  function  for  each  item,  and  then  the  test  validity  function. 
It  is  anticipated  that  we  must  deal  with  a  higher  mathematical  complexity  in  such  a  case.  The  situation 
will  substantially  be  simplified,  however,  if  the  total  set  of  items  consists  of  several  subsets  of  items, 
each  of  which  measures,  exclusively,  a  single  ability  dimension  and  a  single  criterion  dimension. 


VI  Discussion  and  Conclusions 

In  contrast  to  the  progressive  desolution  of  the  reliability  coefficient  in  classical  mental  test  theory 
and  the  replacement  by  the  test  information  function  in  latent  trait  models,  the  issue  of  test  validity 
has  been  more  or  less  neglected  in  modern  mental  test  theory.  The  present  paper  proposes  some 
considerations  about  the  validity  of  a  test  and  of  a  single  item.  Effort  has  been  focused  upon  searching 
for  measures  which  are  population-free,  and  which  will  provide  us  with  local  and  abundant  information 
just  as  the  information  functions  do  in  comparison  with  the  test  reliability  coefficient  in  classical  mental 
test  theory.  In  so  doing,  validity  indices  for  different  purposes  of  testing  and  also  those  which  are 
tailored  for  a  specific  population  of  examinees  are  considered. 

The  above  considerations  for  the  item  and  test  validities  may  be  just  part  of  many  possible  ap¬ 
proaches.  We  may  still  have  a  long  way  to  go  before  we  discover  the  most  useful  measures  of  the  item 
and  test  validities.  The  aim  of  the  present  paper  is  rather  to  provide  stimulation  so  that  researchers 
will  pursue  this  topic  further,  taking  different  approaches. 
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